摘要 :
Markus Hohnerbach et al. recently published a work to optimize the performance of Tersoff potential, which is a computing scheme used in the LAMMPS molecular dynamics (MD) code. The optimization solver was implemented with three d...
展开
Markus Hohnerbach et al. recently published a work to optimize the performance of Tersoff potential, which is a computing scheme used in the LAMMPS molecular dynamics (MD) code. The optimization solver was implemented with three different computation precisions, namely single, mixed, and double. As a special activity of the Student Cluster Competition at SC17 conference, we aimed to reproduce their experimental studies of the optimization solvers in terms of accuracy, performance and scalability. We conducted our experiments on a cluster with Intel Xeon Gold 6130 CPUs and Nvidia Tesla P100 GPU5, while the original work was on a cluster with Intel Xeon E5-2650 CPUs and K40 GPUs. De site of the differences in computing systems, we demonstrate the claims from the original work can be successfully reproduced by showing the solvers with less precision implementation can still achieve high accuracy results, and exhibit good performance speedup and scalability. (C) 2018 Elsevier B.V. All rights reserved.
收起
摘要 :
As a special activity of the Student Cluster Competition at SC18 conference, we made an attempt to reproduce the performance evaluations of an optimized version of earthquake simulation software SeisSol. Our experiments were condu...
展开
As a special activity of the Student Cluster Competition at SC18 conference, we made an attempt to reproduce the performance evaluations of an optimized version of earthquake simulation software SeisSol. Our experiments were conducted on a small scale 4-node cluster with Intel Skylake CPU architecture, while the performance numbers of the original work [19] were collected from a large scale 3000-node supercomputer with Intel Haswell CPU architecture. Both single node performance and cluster scalability are presented and compared in this work. Overall, we also observed significant time-to-solution reductions from the optimized SeisSol code compared to the non-optimized baseline version. Specifically, the original work achieved a 13.6 speedup on a 221 million elements dataset, and we obtained a 4.77 speedup on a 125 thousand elements dataset. But due to the differences between the scale of cluster size and data size as well as the CPU architecture, we did find different behaviors and trends in terms of the performance scalability and computing flops in certain cases. Hence, this work shares our experiences and observations from our reproducibility activity and discuss our findings. (C) 2019 Elsevier B.V. All rights reserved.
收起
摘要 :
With growing applications such as image recognition, speech recognition, ADAS, and AIoT, artificial intelligence (AI) frameworks are becoming popular in various industries. Currently, many choices for neural network frameworks exi...
展开
With growing applications such as image recognition, speech recognition, ADAS, and AIoT, artificial intelligence (AI) frameworks are becoming popular in various industries. Currently, many choices for neural network frameworks exist for executing AI models in applications, especially for training/inference purposes, including TensorFlow, Caffe, MXNet, PyTorch, Core ML, TensorFlow Lite, and NNAPI. With so many different emerging frameworks, exchange formats are needed for different AI frameworks. Given this requirement, the Khronos group created a standard draft known as the Neural Network Exchange Format (NNEF). However, because NNEF is new, conversion tools for various AI frameworks that would allow the exchange of various AI frameworks remain missing. In this work, we fill this gap by devising NNAPI conversion tools for NNEF. Our work allows NNEF to execute inference tasks on host and Android platforms and flexibly invokes Android neural networks through the API (NNAPI) on the Android platform to speed up inference operations. We invoke NNAPI by dividing the input NNEF model into multiple submodels and let NNAPI execute these submodels. We develop an algorithm named BFSelector that is based on a classic breadth-first search and includes cost constraints to determine how to divide the input model. Our preliminary experimental results show that our support of NNEF on NNAPI can obtain a speedup of 1.32 to 22.52 times over the baseline for API 27 and of 4.56 to 211 times over the baseline for API 28, where the baseline is the NNEF-to-Android platform conversion without invoking NNAPI. The experiment includes AI models such as LeNet, AlexNet, MobileNet_V1, MobileNet_V2, VGG-16, and VGG-19.
收起
摘要 :
Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications owing to their extraordinary accuracy. However, their high computational complexity and excessive ...
展开
Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications owing to their extraordinary accuracy. However, their high computational complexity and excessive data storage present two challenges when designing CNN hardware. In this paper, we propose an energy-aware bit-serial streaming deep CNN accelerator to tackle these challenges. Using ring streaming dataflow and the output reuse strategy to decrease data access, the amount of external DRAM access for the convolutional layers is reduced by 357.26x when compared with that of no output reuse case on AlexNet. We optimize the hardware utilization and avoid unnecessary computations using the loop tiling technique and by mapping the strides of the convolutional layers to unit-ones for computational performance enhancement. In addition, the bit-serial processing element (PE) is designed to use fewer bits in weights, which can reduce both the amount of computation and external memory access. We evaluate our design using the well-known roofline model. The design space is explored to find the solution with the best computational performance and communication to computation (CTC) ratio. We can reach 1.36x speed and reduce energy consumption by 41% for external memory access compared with the design in [1]. The hardware implementation for our PE Array architecture design can reach an operating frequency of 119 MHz and consumes 68 k gates with a power consumption of 10.08 mW using TSMC 90-nm technology. Compared to the 15.4 MB external memory access for Eyeriss [2] on the convolutional layers of AlexNet, our method only requires 4.36 MB of external memory access to dramatically reduce the costliest portion of power consumption.
收起
摘要 :
Embedded multicore systems are playing increasingly important roles in the design of consumer electronics. The objective of such systems is to optimize both performance and power characteristics of mobile devices. However, current...
展开
Embedded multicore systems are playing increasingly important roles in the design of consumer electronics. The objective of such systems is to optimize both performance and power characteristics of mobile devices. However, currently there are no power metrics supporting popular application design platforms (such as SID) that application developers use to develop their applications. This hinders the ability of application developers to optimize power consumption. In this article we present the design and experiments of a SIDbased power-aware simulation framework for embedded multicore systems. The proposed power estimation flow includes two phases: IP-level power modeling and power-aware system simulation. The first phase employs PowerMixer(IP) to construct the power model for the processor IP and other major IPs, while the second phase involves a power abstract interpretation method for summarizing the simulation trace, then, with a CPE module, estimating the power consumption based on the summarized trace information and the input of IP power models. In addition, a Manager component is devised to map each digital signal processor (DSP) component to a host thread and maintain the access to shared resources. The aim is to maintain the simulation performance as the number of simulated DSP components increases. A power-profiling API is also supported that developers of embedded software can use to tune the granularity of power-profiling for a specific code section of the target application. We demonstrate via case studies and experiments how application developers can use our SID-based power simulator for optimizing the power consumption of their applications. We characterize the power consumption of DSP applications with the DSPstone benchmark and discuss how compiler optimization levels with SIMD intrinsics influence the performance and power consumption. A histogram application and an augmented-reality application based on human-face-based RMS(recognition, mining, and synthesis) application are deployed as running examples on multicore systems to demonstrate how our power simulator can be used by developers in the optimization process to illustrate different views of power dissipations of applications.
收起
摘要 :
Deep learning compiler tool, Tensor Virtual Machine (TVM), has excellent deployment, compilation, and optimization capabilities supported by the industry following the vigorous growth in neural networks (NN). It has a unified inte...
展开
Deep learning compiler tool, Tensor Virtual Machine (TVM), has excellent deployment, compilation, and optimization capabilities supported by the industry following the vigorous growth in neural networks (NN). It has a unified intermediate representation (IR) format that can provide efficient compilation and portability. However, its high operational complexity requires considerable effort in development. For beginners with programming backgrounds, a new and easy-to-use design approach is needed. This paper proposes a visual concept approach that can execute artificial intelligence (AI) computing using block-based tools with AI knowledge. This research also develops a web-based NNBlocks framework that uses this approach to integrate with TVM. We conduct experiments to evaluate this approach: (1) interviewees assessed intuition through operating. (2) Interviewees answered a Usability Metric for User Experience (UMUX) to evaluate usability. (3) Interviewees answered the significance of the theme survey assessment. (4) The impact on the system was evaluated through experiments. The results indicate that interviewees respond positively to the intuitiveness of the framework. The usability evaluation of UMUX meets expectations. The theme survey shows that the framework is significant for AI learning. The experiments of the impact indicate that the framework will not burden the system.
收起
摘要 :
Minimization of power dissipation can be considered at algorithmic, compiler, architectural, logic, and circuit level. Recent research trends for multicore programming models have come to the direction that parallel design pattern...
展开
Minimization of power dissipation can be considered at algorithmic, compiler, architectural, logic, and circuit level. Recent research trends for multicore programming models have come to the direction that parallel design patterns can be a solution to develop multicore applications. As parallel design patterns are with regularity, we view this as a great opportunity to exploit power optimizations in the software layer. In this paper, we investigate compilers for low power with parallel design patterns on embedded multicore systems. We evaluate four major parallel design patterns, Pipe and Filter, MapReduce with Iterator, Puppeteer, and Bulk Synchronous Parallel (BSP) Model. Our work attempts to devise power optimization schemes in compilers by exploiting the opportunities of the recurring patterns of embedded multicore programs. The proposed optimization schemes are rate-based optimization for Pipe and Filter pattern , early-exit power optimization for MapReduce with Iterator pattern, power aware mapping algorithm for Puppeteer pattern, and multi-phases power gating scheme for BSP pattern. In our experiments, real world multicore applications are evaluated on a multicore power simulator. Significant power reductions are observed from the experimental results. Therefore, we present a direction for power optimizations that one can further identify additional key design patterns for embedded multicore systems to explore power optimization opportunities via compilers.
收起
摘要 :
Heterogeneous systems that consist of multiple CPUs and GPUs for high-performance computing are becoming increasingly popular, and OpenCL (Open Computing Language) provides a framework for writing programs that can be executed acr...
展开
Heterogeneous systems that consist of multiple CPUs and GPUs for high-performance computing are becoming increasingly popular, and OpenCL (Open Computing Language) provides a framework for writing programs that can be executed across heterogeneous devices. Compared with OpenCL 1.2, the new features of OpenCL 2.0 provide developers with better expressive power for programming heterogeneous computing environments. Currently, gem5-gpu, which includes gem5 and GPGPU-Sim, can offer an experimental simulation environment for OpenCL. In gem5-gpu, gem5 only supports CUDA, although GPGPU-Sim can support OpenCL by compiling an OpenCL kernel code to PTX code using real GPU drivers. However, this compilation flow in GPGPU-Sim can only support up to OpenCL 1.2. OpenCL 2.0 provides new features such as workgroup built-in functions, extended atomic built-in functions, and device-side enqueue. To support OpenCL 2.0, the compiler must be extended to enable the compilation of OpenCL 2.0 kernel code to PTX code. In this paper, the proposed compiler is modified from the low level virtual machine (LLVM) compiler to extend such features to enhance the emulator to support OpenCL 2.0. The proposed compiler creates local buffers for each workgroup to enable workgroup built-in functions and adds atomic built-in functions with memory order and memory scope for OpenCL 2.0 in NVPTX. Furthermore, the APIs available in CUDA are utilized to implement the OpenCL 2.0 device-side enqueue kernel and compilation schemes in Clang are revised. The AMD APP SDK 3.0 and NTU OpenCL benchmarks are used to verify that the proposed compiler can support the features of OpenCL 2.0.
收起
摘要 :
Currently, GPGPU-Sim has become an important vehicle for academic architecture research. It is a cycle-accurate simulator that models the contemporary graphics processing unit. Machine learning has now been widely used in various ...
展开
Currently, GPGPU-Sim has become an important vehicle for academic architecture research. It is a cycle-accurate simulator that models the contemporary graphics processing unit. Machine learning has now been widely used in various applications such as self-driving car, mobile devices, and medication. With the popularity of mobile devices, mobile vendors are interested on porting machine learning or deep learning applications from computers to mobile devices. Google has developed TensorFlow Lite and Android NNAPI for mobile and embedded devices. Since machine learning and deep learning are very computationally intensive, the energy consumption has be come a serious problem in mobile devices. Moreover, Moore's law cannot last forever. Hence, the performance of the mobile device and computers such as desktops or servers will have limited enhancements in the foreseeable future. Therefore, the performance and the energy consumption are two issues of great concern. In this paper, we proposed a new data type, fixed-point, which is a low-power numerical data type that can reduce energy consumption and enhance performance in machine learning applications. We implemented the fixed-point instructions in the GPGPU-Sim simulator and observed the energy consumption and performance. Our evaluation demonstrates that by using the fixed-point instructions, the proposed design exhibits improved energy savings. Our experiment indicate that the use of fixed-point data type saves at least 14% of total GPU energy consumption than floating-point data type.
收起
摘要 :
With the progress of medical science and technology and the healthy changes in eating habits, the proportion of aged population is gradually increasing. Smart-home elderly care has thus attracted a lot of research attention in the...
展开
With the progress of medical science and technology and the healthy changes in eating habits, the proportion of aged population is gradually increasing. Smart-home elderly care has thus attracted a lot of research attention in the recent past, and remains an active issue. Internet of Things (IoT) has been recognized as a key enabler for smart-home elderly care realization. In the literature, a large number of IoT services/applications and platforms have been proposed for health/elderly care. However, research reports showed that their adoption ratio is very low. This is mainly because these packet-switched IoT data communication/networking approaches are too complicated from the perspective of older people. Statistical data reflect that circuit-switched voice telephony is still their most favorable communication mechanism. In this paper, we aim to implement a circuit-switched approach to realizing smart-home elderly care IoT remote control based on our IoT platform,IoTtalk. To the best of our knowledge, our IoTtalk incorporating such capability is the first generic IoT platform in the literature that provides a telecommunication solution for smart-home elderly care. We design and implement an Android App, DTMFTalk, following the application development/execution framework of IoTtalk, to support IoT remote control via circuit-switched Dual Tone MultiFrequency (DTMF) signaling during a phone call conversation. Our real testbed deployment demonstrates that DTMFTalk can constantly and accurately recognize DTMF keys as long as the user holds the desired DTMF keys with enough periods, justifying that DTMFTalk can serve as an effective approach to IoT remote control for smart-home elderly care.
收起